Ambric Discloses Massively Parallel Architecture

On August 21^st fabless semiconductor startup Ambric unveiled its massively parallel processor architecture. Ambric joins a host of start-ups pursuing a similar idea: chaining together a large number of simple RISC-like processor cores in ways intended to avoid inter-processor communication bottlenecks and programming problems found in traditional multiprocessor systems. In the Ambric architecture, individual processors can run at different clock speeds, and processors operate asynchronously relative to each other. Asynchronous hardware channels coordinate communication between processors at run-time, avoiding the need for a global synchronization scheme.

“At Ambric, we believed the key to a practical solution for massively parallel embedded computing was the unrelenting focus on first developing the right programming model. After developing that model, we then invented new hardware architectures and circuit designs to enable it,” explains Mike Butts, senior IC architect at Ambric. As illustrated in Figure 1, to implement an application, the developer decomposes his application into a hierarchical structure of basic tasks, or “objects” in Ambric’s nomenclature, specifying the data and control messages each object sends and receives. Objects are then implemented using a conventional compiler and debugged using a functional simulator. When the objects have been implemented and verified, a realization tool chain is run which maps the objects onto processors and memories, routes the interprocessor communications, and creates a configuration file for the chip. The chip is configured upon start-up much like an FPGA, with the configuration file being read from flash or supplied by an external host. Once the chip is configured, real-time debugging is enabled by employing unused processors and memories to monitor channel traffic.

ambric_prog_model

Figure 1. Example hierarchical grouping of Ambric “objects”

There are two types of CPUs in the Ambric architecture, called SRs and SRDs. The SRD CPU is a 32-bit RISC processor with DSP extensions that is used for math-intensive operations. The SR CPU is a simpler 32-bit RISC CPU used for utility tasks such as generating address streams and coordinating hardware channels. Two SRDs and two SR CPUs are connected to form a larger block called a Compute Unit (CU), as shown in Figure 2. On-chip memories are encapsulated in Ram Units (RU), which have four 1 kbyte RAM banks and processing engines that enable communication with other RUs and CPUs. Each CU-RU pair has its own clock domain and can be operated at a unique clock rate as required by the tasks running on it. Two CUs and two RUs form the basic architectural building block, called a “bric”; brics are replicated in a rectangular array across the chip.

ambric_architecture

Figure 2. Compute Unit (CU) and RAM Unit (RU). Two CUs and two RUs are connected to make a Bric.

The Ambric architecture will face competition from a number of other massively parallel multiprocessor architectures from companies such as Connex, Tilera, and Telairity. These architectures are generally grouped into two categories; single instruction multiple data (SIMD), and multiple instruction multiple data (MIMD). For example, the architecture used in the Connex CA1024 processor featured in last month’s Inside DSP is SIMD. Each processor in the CA1024’s linear array of 1024 simple RISC-like processors must execute the same instruction as all other processors or none at all. The Ambric architecture is fully MIMD, with each processing element processing instructions independently. Ambric’s stance is that while SIMD architectures are very fast, they are too narrowly focused and appropriate only for a very limited number of applications. Ambric believes that its architecture’s increased flexibility allows for high performance on a broader range of applications. Vendors with SIMD architectures, on the other hand, see these limitations as strengths. Connex, for example, believes that by focusing on high-definition digital video encoding and decoding in its CA1024 chip, it can offer an HD video solution with better performance than can be achieved with MIMD processors.

One crucial factor determining Ambric’s success or failure will be the programming model. While massively parallel processors offer the possibility of dramatic performance gains over traditional architectures, these gains will only be realized if the programming model is user friendly and efficient. Ambric’s programming model, while appealing in many ways, is still untested and raises many questions. For one, in complex applications such as H.264 decoders, complex data dependencies will make mapping objects to processors a non-trivial task.

Ambric has tested its architecture in a prototype standard-cell ASIC. The test chip core contains 45 brics in a five-by-nine array. This translates to 360 32-bit processors and 4.6 Mbits of distributed SRAM. The processor operates at up to 333 MHz. Ambric’s recent disclosure did not include announcement of any specific products; a product announcement is expected in October 2006.

Figure 1. Example hierarchical grouping of Ambric “objects”

Figure 2. Compute Unit (CU) and RAM Unit (RU). Two CUs and two RUs are connected to make a Bric.

Add new comment